Skip to content

Conversation

@cj-zhukov
Copy link
Contributor

…18142)

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@Jefffrey
Copy link
Contributor

Would you be able to provide a high level overview of the changes made here? The PR seems pretty big and it's hard to review without some description of the changes made

@cj-zhukov
Copy link
Contributor Author

High-Level Overview

This PR consolidates multiple standalone example binaries into a single example binary that uses subcommands to run individual examples. The main goal is to reduce the number of distinct binaries and make the examples easier to organize and maintain.

Key Changes

  • Grouped similar examples under common folders (e.g. sql_ops, datafusion, etc.).

  • Each example can now be launched via a subcommand.

Example Usage

To run a specific example, use:

cargo run --example datafusion -- datafusion

@cj-zhukov
Copy link
Contributor Author

I’ve noticed that the CI cargo audit check is failing due to the following advisories:

  • RUSTSEC-2025-0111 (tokio-tar)
  • RUSTSEC-2024-0014 (generational-arena)
  • RUSTSEC-2024-0436 (paste)

My PR doesn’t introduce or modify any of these dependencies - they appear to be transitive dependencies already present in the project.

Could someone from the PMC confirm whether these advisories should be handled separately (e.g., ignored in CI or tracked in a dedicated issue)?
I just wanted to make sure I don’t make unnecessary config changes in this PR.

@cj-zhukov cj-zhukov force-pushed the cj-zhukov/Reduce-the-number-of-distinct-binaries branch from bc2f0a6 to 0a9cbe2 Compare October 28, 2025 05:39
@cj-zhukov
Copy link
Contributor Author

I’ve noticed that the CI cargo audit check is failing due to the following advisories:

* RUSTSEC-2025-0111 (tokio-tar)

* RUSTSEC-2024-0014 (generational-arena)

* RUSTSEC-2024-0436 (paste)

My PR doesn’t introduce or modify any of these dependencies - they appear to be transitive dependencies already present in the project.

Could someone from the PMC confirm whether these advisories should be handled separately (e.g., ignored in CI or tracked in a dedicated issue)? I just wanted to make sure I don’t make unnecessary config changes in this PR.

It was fixed: #18288

@cj-zhukov
Copy link
Contributor Author

The merge conflict in encrypted.rs has been resolved, and all checks have passed. Ready for review.

Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking this up.

I think rust_example.sh needs to be fixed to run the examples in this new format.

At least from the compilation artifacts it seems a nice win:

# off main
datafusion (main)$ du -s -h ~/.cargo_target_cache/ci/examples/
5.0G    /Users/jeffrey/.cargo_target_cache/ci/examples/
# off this PR
datafusion (pr_18289)$ du -s -h ~/.cargo_target_cache/ci/examples/
1.4G    /Users/jeffrey/.cargo_target_cache/ci/examples/


//! # Advanced UDF/UDAF/UDWF/Asynchronous UDF Examples
//!
//! This example demonstrates advanced user-defined functions in DataFusion.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//! This example demonstrates advanced user-defined functions in DataFusion.
//! These examples demonstrates advanced user-defined functions in DataFusion.

#[tokio::main]
async fn main() -> Result<()> {
let arg = std::env::args().nth(1).ok_or_else(|| {
eprintln!("Usage: cargo run --example advanced_udf -- [udf|udaf|udwf|async_udf]");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it's better to construct the options from ExampleKind so if we add examples in future there's one less place to forget to update 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good point

- [`sql_dialect.rs`](examples/sql_dialect.rs): Example of implementing a custom SQL dialect on top of `DFParser`
- [`sql_query.rs`](examples/memtable.rs): Query data using SQL (in memory `RecordBatches`, local Parquet files)
- [`date_time_function.rs`](examples/date_time_function.rs): Examples of date-time related functions and queries.
- [`advanced_udaf.rs`](examples/advanced_udaf/udaf.rs): Define and invoke a more complicated User Defined Aggregate Function (UDAF)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to uplift this doc to be a table instead, so it's easier at a glance to see which command to run for which example, e.g.

Group Example
advanced_udaf udf
advanced_udaf udwf
query_planning analyzer_rule

Or would this cause too much duplication of information and be prone to drift?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes a lot of sense to create a table

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to consolidate the examples anyways, I wonder if it makes sense to consolidate them all into a single (or even fewer) binaries?

AsyncUdf,
}

impl FromStr for ExampleKind {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do wonder if there's a way to templatize/abstract this whole file so they don't drift in structure from the others 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - indeed, the example entry files follow a similar pattern. It could make sense to refactor this into a shared helper or macro in the future to keep them in sync. For now, I kept this consistent with the existing example structure.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @cj-zhukov -- this is an exciting idea but I think it is going to be very hard to review a PR of this size (it will require a substantial amount of contingiuous review time, which is quite hard to find)

Is there any way to break it up into smaller pieces?

For example, perhaps you could consolidate all the flight examples into a single example binary as a single PR?

Then we can make sure we are agreed on the pattern and then we can apply it to the remaining examples

- [`sql_dialect.rs`](examples/sql_dialect.rs): Example of implementing a custom SQL dialect on top of `DFParser`
- [`sql_query.rs`](examples/memtable.rs): Query data using SQL (in memory `RecordBatches`, local Parquet files)
- [`date_time_function.rs`](examples/date_time_function.rs): Examples of date-time related functions and queries.
- [`advanced_udaf.rs`](examples/advanced_udaf/udaf.rs): Define and invoke a more complicated User Defined Aggregate Function (UDAF)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to consolidate the examples anyways, I wonder if it makes sense to consolidate them all into a single (or even fewer) binaries?

@xudong963 xudong963 changed the title Reduce the number of distinct binaries to reduce CI / compile time (#… Reduce the number of distinct binaries to reduce CI / compile time Nov 1, 2025
@cj-zhukov
Copy link
Contributor Author

Thank you @cj-zhukov -- this is an exciting idea but I think it is going to be very hard to review a PR of this size (it will require a substantial amount of contingiuous review time, which is quite hard to find)

Is there any way to break it up into smaller pieces?

For example, perhaps you could consolidate all the flight examples into a single example binary as a single PR?

Then we can make sure we are agreed on the pattern and then we can apply it to the remaining examples

Thank you for the feedback! That makes perfect sense - I agree it’ll be much easier to review and discuss the approach in smaller steps.
I’ll start by creating a separate PR that consolidates the flight examples into a single example binary using this pattern. Once that’s reviewed and agreed upon, I can apply the same structure to the remaining examples.

@cj-zhukov
Copy link
Contributor Author

@Jefffrey @alamb That’s a great point - I agree that consolidating further could simplify maintenance and make it easier for new users to explore examples.

For now, I’ll start with a smaller PR focusing on the flight examples to confirm the pattern.
Once that’s reviewed, I’ll explore combining related groups (for example, all UDF examples into a single udf_examples binary).

This way, we can gradually reduce the number of binaries while keeping reviews manageable.

@alamb
Copy link
Contributor

alamb commented Nov 3, 2025

@Jefffrey @alamb That’s a great point - I agree that consolidating further could simplify maintenance and make it easier for new users to explore examples.

Thank you 🙏

github-merge-queue bot pushed a commit that referenced this pull request Nov 4, 2025
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- part of ##18142.

## Rationale for this change
As discussed in #18289 this PR
is for consolidating all the `flight` examples into a single example
binary. Then we can make sure we are agreed on the pattern and then we
can apply it to the remaining examples

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->

---------

Co-authored-by: Sergey Zhukov <szhukov@aligntech.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
jizezhang pushed a commit to jizezhang/datafusion that referenced this pull request Nov 5, 2025
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes apache#123` indicates that this PR will close issue apache#123.
-->

- part of #apache#18142.

## Rationale for this change
As discussed in apache#18289 this PR
is for consolidating all the `flight` examples into a single example
binary. Then we can make sure we are agreed on the pattern and then we
can apply it to the remaining examples

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->

---------

Co-authored-by: Sergey Zhukov <szhukov@aligntech.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
@alamb
Copy link
Contributor

alamb commented Nov 7, 2025

I think the @cj-zhukov is breaking this one up into more smaller PRs so marking this one as draft

@alamb alamb marked this pull request as draft November 7, 2025 21:45
@cj-zhukov
Copy link
Contributor Author

I think the @cj-zhukov is breaking this one up into more smaller PRs so marking this one as draft

Thanks, @alamb ! Yes, that’s correct - I’m currently splitting this work into smaller PRs. I’ll continue updating and consolidating all the remaining examples as part of this effort.

codetyri0n pushed a commit to codetyri0n/datafusion that referenced this pull request Nov 11, 2025
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes apache#123` indicates that this PR will close issue apache#123.
-->

- part of #apache#18142.

## Rationale for this change
As discussed in apache#18289 this PR
is for consolidating all the `flight` examples into a single example
binary. Then we can make sure we are agreed on the pattern and then we
can apply it to the remaining examples

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->

---------

Co-authored-by: Sergey Zhukov <szhukov@aligntech.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reduce the number of distinct binaries to reduce CI / compile time

3 participants